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Abstract 

Several research projects around the world 
are building grammatically analysed cor- 
pora; that is, collections of text annotated 
with part-of-speech wordtags and syntax 



trees. However, projects have used quite 



there is anecdotal evidence that even the largest on 
its own is too small for a general statistical model 
of higher-level syntactic structure. As annotating 
corpora using hand-crafted markup or some semi- 
automated process followed by correction by linguis- 
tic experts is slow and expensive ( [Barkcma 93 ; Leech 



and Garsidc 91) it would be preferable if some other 



different wordtagging and parsing schemes. 
Developers of corpora adhere to a variety 
of competing models or theories of gram- 
mar and parsing, with the effect of restrict- 
ing the accessibility of their respective cor- 
pora, and the potential for collation into a 
single fully parsed corpus. In view of this 
heterogeneity, we have begun to investi- 
gate and develop methods of automatically 
mapping between the annotation schemes 
of the most widely known corpora, thus 
assessing their differences and improving 
their reusability. Annotating a single cor- 
pus with the different schemes allows for 
comparisons and will provide a rich test- 
bed for automatic parsers. Collation of all 
the included corpora into a single large an- 
notated corpus will provide a more detailed 
language model to be developed for tasks 
such as speech and handwriting recogni- 
tion. This paper focuses on methods of 
developing mappings between tagsets and, 
in particular, the method of automatic ex- 
traction of mappings from corpora tagged 
with more than one annotation scheme. 



1 Introduction 

Many, diverse tagged and parsed corpora have been 
developed. Amongst the applications of annotated 
corpora are as training sets for the extraction of 
models used in speech and handwriting recognition. 
Such training sets need to be as large as possible and 



method of building a large annotated corpus could 
be found. Existing corpora were not designed to 
a specific framework of annotations so corpora can 
not easily be collated into a single large training 
set. The AMALGAM (automatic mapping among 
lexico-grammatical annotation models) project was 
set up to research ways of mapping between anno- 
tation schemes in order to increase the size of cor- 
pus tagged with the schemes included in the project 
( [Atwell et al 94a| ; |Atwell et al 94b|) . 

We are developing a multi-tagged corpus and a 
multi-treebank, a single text-set annotated with all 
the tagging and parsing schemes we include in the 
mappings. The text-set is the Spoken English Cor- 
pus (SEC); which is already annotated with two 
syntax schemes. However, the main deliverable to 
the computational linguistics research community is 
not the SEC-based multi-treebank, but its associ- 
ated suite of mappings - this can be used to combine 
currently-incompatible syntactic training sets into a 
large unified corpus. Our development of the map- 
ping algorithms aims to distinguish notational from 
substantive differences in the annotation schemes, 
and we will be able to evaluate tagging schemes in 
terms of how well they fit standard statistical lan- 
guage models such as n-pos (Markov) models. 

Although the above description assumes mapping 
between tagsets from monolingual corpora we be- 
lieve the issues extend to multilingual tagsets. The 
tagsets of two languages usually differ in the fea- 
tures they cover. For example French may have tags 
to discriminate gender whereas English does not. 
However, tagsets of English do not necessarily mu- 



tually cover all features. For instance, the British 
component of the International Corpus of English 
(Greenbaum 93) has a tagging scheme that accounts 
for transitivity of verbs whereas the Lancaster/Oslo 
Bergen corpus (Johansson et al 86) does not 



nor 

We believe 



do the EAGLES proposals - see below) 
that our methods are scalable to mappings between 
multilingual tagsets. 

2 Related Research 

Corpus-trained statistical language learning tech- 
niques have been successfully applied to a range 
of problems in computational linguistics, including 



part-of-speech wordtagging (Leech et al 83; Atwell 



Atwell 87a), word sense disambiguation and tag- 



ging ( Dcmctriou and Atwell 93 ; Gale et al 92), learn- 
ing word classes (lAtwell 87bj |Atwell and Drakos 87k 



mar modelling and induction (Atwell 8 



Hughou and Atwell 9§ |Hughou and Atwell 9d|), ga m 



time-aligned database of recorded speech, accom- 
panied by phonetic and graphemic transcriptions 
(Knowles 93). Our proposal will produce, as a side- 
effect, several alternative tagged and parsed versions 
of the SEC which will be made available to the SEC 
database project collaborators. It will also be able to 
act as a test-bed for the comparison and evaluation 
of parsing schemes. 

Obtaining resources proved to be a stumbling 
block. Whilst most of the people in charge of cor- 
pus annotation and distribution are helpful they are 
also usually very busy! Sometimes there are reserva- 
tions about distribution of resources. For example, 
the corpus could have copyright restrictions or could 
be collected for dictionary compilation. However, 
we have obtained the following corpora in tagged or 
parsed form along with manuals defining the syntac- 
tic annotation schemes: Brown (Francis and Kucera 



et al 92 ; Atwell 93 ; Jost and Atwell 94 ), grammatical 



91; [Magerman and Marcus 91 ; Bouter and Atwell 



Young 00|; ^Wl and Charniak jg gjggg |Brill London-Lund flSvaxtvik 90|), Polytechnic of 



79|), L OB ([Atwell 82| ; |Atwell et al 84| ; |Johansson et 



error det ection ( Atwell 88a ; | At well 90 ), probabilistic 
mr sing ([Bampson et al 89 ;_ Soutcr and O'Donoghuc 



92; Atwell et al 91; Briscoe and Waegner 92; Black 



et al 93j ). Particularly relevant to AMALGAM is 
the recent research interest in Machine Translation 
using statistical learning techniques for mapping- 
extraction from parallel corpora ( Brown et al 9C ; 
Brown et al~92] ; |Chen et a!9l| ; |Wu and Xia 94| ) . 



Wales ( Bouter 89 ; Fawcett and Perkins 80 ) and will 
apply for the British National Corpus as soon as it 
becomes available. We also have the software used 
for annotating the University of Pennsylvania corpus 
( Brill and Marcus 92j ; Marcus and Santorini 92) and 
the International Corpus of English ( Greenbaum 93] ; 



Barkema 93) 



3 Obtaining Resources 

As a development and testing resource, we are using 
the text of the Lancaster-IBM Spoken English Cor- 



pus (SEC) ( [Taylor and Knowles 88j ). The SEC is a 
collection of recordings of radio broadcasts with ac- 
companying annotated transcriptions, collected by 
Lancaster University and IBM UK as a general re- 
search resource. The SEC is available from the In- 
ternational Computer Archive of Modern English 
(ICAME) based at the Norwegian Computing Cen- 
tre for the Humanities (in Bergen, Norway). The 
corpus exists in several forms and annotations: the 
digitised acoustic waveform; the graphemic tran- 
scription annotated with prosodic markings; and 
a part-of-speech analysis that was annot ated semi- 
automatically with the aid of CLAWS ( Atwell 85 ; 
Leech et al 83 ) as used for the LOB corpus. Skele- 
tal parsing has been added to create the SEC Tree- 
bank, and this forms a subset of the Lancaster-IBM 
Treebank. Gerry Knowles (Lancaster) and Peter 
Roach (Reading, formerly of Leeds) collaborated in 
an ESRC-funded project, MARSEC, to set up a 



The following table summarises the resources we 
have for the six main corpora we have included in the 
project so far. The first column reveals if we have the 
corpus itself: we have all but the International Cor- 
pus of English. The next column indicates if we have 
the software that was used in the automated part of 
annotating of the corpus. The next column shows for 
which corpora we have documentation giving formal 
descriptions of the annotation guidelines. The last 
column marks the London-Lund and Brown corpus 
with a T' to indicate that we have a small sample 
of corpus annotated using both these schemes. The 
'2' marker in this column indicates the Parallel An- 
notated Corpus that we are building at the moment 
by adding the International Ccorpus English (GB) 
annotation to the Spoken English Corpus. 



Table 1: Summary of Resources 
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4 Deriving Tagset Mappings 

When we began the AMALGAM project we antici- 
pated that the following process would be the normal 
way that an annotation scheme was included in our 
'mapping suite': 

1. Develop the most accurate mapping between 
the new scheme and one of the schemes already 
in the mapping suite. Only one pair need to be 
mapped explicitly as the other mappings can be 
generated from intermediaries v ia an 'interlin- 
gua' approach ( Atwell et al 94b ). 



2. Annotate the Spoken English Corpus using the 
mapping. 

3. Correct the mapped annotation, preferably us- 
ing advice from the people responsible for the 
annotation scheme. 

The uneven spread of resources means that alter- 
native mapping strategies must be adopted when in- 
cluding each annotation scheme (see table 1). As we 
have the software used to tag and parse the Inter- 
national Corpus of English we can incorporate that 
into the mapping. Good formal descriptions of the 
annotation scheme (such as for LOB) can be used 
to craft some rules by hand. Where the documenta- 
tion is sparse rules can be extracted from the corpus 
itself. 

We require a method to evaluate the alternative 
mapping strategies: A simple evaluation can be ac- 
complished by tagging the untagged SEC using one 
annotation scheme (the evaluation scheme) by the 
tried and tested method of automatic annotation fol- 
lowed by hand correction. To test a mapping strat- 
egy one would apply the mapping from the eval- 
uation scheme tags to produce those of the SEC. 
The success of the mapping would be determined 
by measuring the difference between this annotation 
and the original SEC (CLAWS tagged) annotation 
produced by Lancaster. 

The Parallel Annotated Corpus (PAC) created 
when a (non-CLAWS) evaluation scheme is used 
to tag the Spoken English Corpus in this way it- 
self provides further possibilities for developing map- 
ping strategies. The PAC may intrinsically encode 
mapping information that would not be uncovered 
from other mapping strategies. Extracting a map- 
ping from a PAC is computationally trivial; the dif- 
ficulty is annotating an existing corpus with a new 
scheme. However, PACs already exist for pairs of 
annotation scheme and this provides an easy way to 
extract mapping information. This is particularly 



true when the annotation scheme of one corpus is re- 
placed by another. Initially this would be done using 
the automatic annotator of the new scheme followed 
by hand-correction by linguistic experts. However, 
the addition of the new scheme to part of the cor- 
pus creates a PAC from which a mapping can be de- 
rived. The mapping could be used to update the per- 
formance of the automatic annotator. A process of 
refinement of the automatic annotator by feedback 
derived from the mapping would be established. 

This paper focuses on deriving tagset mappings 
from PACs as we are currently in the phase of 
our project where we are concentrating on parts-of- 
speech annotation. However, we anticipate that the 
method will be even more useful when dealing with 
mapping between parse trees. 

5 Extraction of Correspondences 
from Parallel Annotated Corpora 

Although a few PACs already exist only a few tagset 
pairings are covered. Often a corpus is annotated 
with a scheme that the designers feel can be im- 
proved so they annotate the same texts with the up- 
dated scheme. This automatically results in a PAC 
being formed. An example PAC comprises a few sec- 
tions of the Brown corpus that were annotated by 
additional London-Lund markup (Eeg-Olofsson 91). 
A further example is the Nijmegen Corpus which was 
originally annotated with CCPP annotation (Keulen 
|36| ) but later replaced with the scheme used to an- 
notate the British component of the International 
Corpus of English (Greenbaum 93). Although the 
Nijmegen TOSCA team now view the CCPP scheme 
as largely obsolete it is still a useful resource for 
mapping extraction as the PAC is 130,000 words in 
length. This provides a large sample from which to 
evaluate alternative mapping strategies. 

To use the method of deriving mappings from 
PACs it is inevitable that some traditional tagging 
is required to build the parallel corpus. As an exam- 
ple of the process of extracting correspondences from 
PACs we shall use the example of the SEC-ICE cor- 
pus. As a PAC does not exist for this pair of tagsets 
we had to build our own. As we aimed to produce 
the multitagged corpus out of the texts of the Spo- 
ken English Corpus it made sense to annotate the 
Spoken English Corpus with ICE tags. 

We employed an experienced annotator of cor- 
pora, Tim Willis, to learn the ICE annotation 
scheme and apply it to the Spoken English Cor- 
pus by editing the automatic output of the Nijmegen 
parser which was designed to annotate ICE-GB ma- 
terial. For the moment we are concentrating on de- 



riving mappings between tagged annotation but it 
was felt more cost effective to parse and tag the Spo- 
ken English Corpus now as our project will eventu- 
ally include parse mappings. 



The output from the Nijmegen parser (Barkema 
|93| ) needs to be aligned with the markup in the Spo- 
ken English Corpus. Problems are caused by the 
taggers segmenting text by different methods. Some 
taggers convert words not normally capitalised into 
lowercase, but not all do. This causes problems try- 
ing to match the words again once annotation has 
taken place. The Spoken English Corpus has sen- 
tence boundaries after full stops, exclamation marks 
and question marks whereas the Nijmegen parser 
additionally delimits text separated by colons and 
semicolons. The Nijmegen parser and The Spoken 
English Corpus tagging scheme deal with enclitics in 
a similar manner; a word like who 's being split into 
the separate items who and 's. Other schemes may 
leave such words as they are. To be aligned with 
the Spoken English Corpus would require the word 
and its corresponding tag to to be split. On the 
other hand, a proper noun such as New York may 
be assigned a single tag and treated as a single item 
rather than having the two words treated individu- 
ally as in the Spoken English Corpus. The Nijmegen 
parser does this when producing parsed output but 
not when producing tagged output. Some parsers 
alter the text they annotate; again making the align- 
ment process more difficult. A common practice is 
the removal of capital letters from words that would 
not normally have them were they not starting a 
sentence. Worse, the item may be transformed al- 
together. A semicolon found in the input to the Ni- 
jmegen parser is transformed into the string &semi; 
as the semicolon on its own wo uld be mistaken for 
an SGML marker ( Burnard 91 ). Such issues make 



alignment a non-trivial task. 
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Figure 1 : Alignment of SEC and ICE 

To align texts annotated by two schemes we used 
a method we term island driven alignment. The 'is- 
lands' are the singletons found to be present in the 
output of both schemes. The position of these items 
can easily be aligned. The words next to the islands 
can be examined in turn. Often they will match 
and so can be aligned immediately, but occasionally 
the next pair of items will not match. Attempting 
to split enclitics, recombine split compounds or al- 
tering initial letter case may match some pairs but 
others such as the semicolon problem mentioned ear- 
lier will require pattern matching of the surrounding 
text. Occasionally an item in one of the annotations 
will match with no item in the other; the extra end of 
sentence markers in ICE texts being a good example. 
When this happens it can only be discovered after 
aligning the items on either side of it with neighbour- 
ing items in the other annotated output. The first 
few lines of the Spoken English Corpus when aligned 
with the ICE tags of the same text are shown figure 
1, above. The first two columns are the words and 
CLAWS tags from the tagged SEC and the remain- 
ing column contains the corresponding ICE tags. 

The Spoken English Corpus contains the short 
header: (In Perspective) (Rosemary Hill). The pro- 
cess by which ICE was annotated excluded headers 
such as this (they will be tagged by hand). As the 
header is not included in the ICE annotation of the 
text there is nothing to align it to. 

Each pairing of tags can now be counted and a 



list of correspondences made for each individual tag 
to show the probabilities of each pair. For instance 
the London-Lund/Brown PAC produced the list of 
London-Lund correspondences for the interrogative 
wh-determiner tag, WDT, in Brown shown in figure 
2. 

B2deg 2 . 137. 
BHitr 25 . 53% 

WDT x > BRwha 4.267, 

GAwhi 53.197. 
GCwha 14 . 897. 

Figure 2: Correspondences for WDT 

The Brown tag WDT pairs with the London-Lund 
tag GAwhi, relative pronoun: which, just over half 
the time in the PAC. The easiest way to convert 
these correspondences into a mapping is to map the 
tag in one scheme always onto the most common 
pairing found in the PAC. Many tags will have a 
1:1 mapping or will pair with one particular tag in 
the other scheme almost all the time. However, the 
above example correspondence list illustrates where 
mapping the most common pairing will work badly. 
We are currently investigating methods of incorpo- 
rating the lexicon (which could be extracted from 
the corpus samples we have, or from the PACs we 
have built ourselves) or using the contextual infor- 
mation supplied by the neighbouring words and tags. 
We also hope to explore methods developed by Brill 
in which texts were first tagged by always select- 
ing the most common tag for a word, and then the 
tag selection refined with a set of automatically ex- 
tracted rewrite rules, or patches (Brill 91). 



6 Lessons for the EAGLES Initiative 

Until recently, very little effort has been expended on 
the development of standards in tagging and pars- 
ing natural language corpora. Individual tagging 
and parsing schemes have been invented more or less 
independently, and differ not only in the linguistic 
description, but also in the formalism used to la - 
bel words or represent tree structures. ( Souter 93 ) 
surveys some of the substantive differences between 
such formalisms for contemporary parsed corpora of 
English, and illustrates how standards are needed to 
facilitate the reusability of corpus resources (through 
enterprises such as the Text Encoding Initiative), 
and to improve the general applicability of corpus- 
processing software, such as the Nijmegen Linguistic 
DataBase (van Halteren and van den Hcuvcl 90). 
As many participants at the workshop will know, 



EAGLES is a European initiative to devise a set of 
common standards for Natural Language Process- 
ing technology across the range of European Union 
working languages. Of particular relevance to our 
research are the standards proposals for morphosyn- 
tactic wordclasses; a lengthy draft proposal (over 
200 pages) has recently been made available to EL- 
SNET nodes and a number of other centres of ex- 
pertise for comment. The proposals aim to stan- 
dardise a set of wordclasses to be applied to Dan- 
ish, Dutch, English, French, German, Greek, Italian, 
Portuguese, and Spanish; once (or if) agreed, the 
standards may later be extended to cover other lan- 
guages (e.g. Swedish, Finnish, Norwegian, Gaelic, 
Welsh, Basque, . . . ) Even among the current EU 
main languages, there is considerable diversity in 
morphosyntax, so the EAGLES group are to be con- 
gratulated for achieving a compromise which on the 
face of it is largely uncontentious. EAGLES rec- 
ommends several levels of refinement or delicacy in 
wordclasses, so that specific applications and/or lan- 
guage models are free to select an appropriate level 
of tagset granularity. For example, NOUN is a broad 
(level 1) category, a general class which all language 
models must recognise; within this, there is a level 2 
subdivision into proper nouns and common nouns, 
which will apply to many but not all applications 
etc. Many other possible wordclass distinctions are 
captured by features, e.g. number, gender; some of 
these do not apply to certain languages (eg gender 
of English nouns). 

Unfortunately, the divisions between word classes 
and subclasses are made in terms of examples, and 
appeals to linguistic intuition. This is reasonable 
and normal practice in lexicography and language 
teaching; but for computational implementation def- 
initions and boundaries need to be more clearly spec- 
ified. Otherwise, there is a danger that NLP sys- 
tems will adopt wordclass-demarcations on grounds 
of computational tractability, which may not agree 
with the linguistically correct/intuitive definition. 
Worse still, although linguists agree on the general 
" common-sense" definitions of categories like proper 
noun, common noun etc, our analysis of competing 
tagsets for English corpora shows that these cate- 
gories are in fact 'fuzzy', and different corpus tagging 
projects have adopted subtly but significantly differ- 
ent definitions, probably unaware that their analyses 
are incompatible with those of other linguists. The 
EAGLES recommendations include a call to corpus 
tagging projects to provide their manuals or tagset- 
definitions along with the final tagged corpus, but 
we have found that, to date, tagging project teams 
have deemed these 'case-law' handbooks as 'training 



in progress statements' not worth publishing - with 
the notable exception of ( Johansson ct al 80) . 



Our earlier example of parallel CLAWS /ICE tag- 
ging of the Spoken English Corpus illustrates the 
fuzziness in the distinction between proper noun and 
common noun. In general, a singular proper noun is 
NP in LOB and CLAWS, but N(prop,sing) in ICE. 
However, notice that Perspective, the second word 
in the corpus, is tagged NP. This may have been 
because the word begins with a capital, and the tag- 
ging system uses this as a deciding criterion (how- 
ever, note that the previous word, In, escapes this 
default NP tagging because English text requires the 
first word of every sentence to start with a capital, so 
the tagging system by default converts this to lower 
case and tags according to dictionary- lookup) . To a 
linguist, this analysis of Perspective may intuitively 
be an 'error; however there are no definitions within 
the EAGLES guidelines which rule out such counter- 
intuitive computationally-motivated criteria. 

A second example of disagreement over the proper 
and common noun boundary is the analysis of Rev- 
erend Sun Myung Moon - in ICE this is tagged 
as a proper-noun sequence (or rather, a com- 
pound proper-noun single lexical item), but in 
LOB/CLAWS, one fuzzy boundary between com- 
mon and proper nouns is recognised - the area of 
titular nouns tagged NPT (for example, Reverend 
can start with upper or lower case in much the same 
context, so NPT avoids conflicting taggings depend- 
ing on the case of the initial letter). Further ex- 
amples abound in the parallel corpus; generally the 
problem arises from differences in the handling of 
upper-case initial letter. 

Our conclusion for the EAGLES Initiative is that 
the morphosyntactic category proposals must be fol- 
lowed up with detailed definitions, preferably includ- 
ing computable criteria. In the specific example of 
nouns, there must be clear standards on handling of 
word-initial case. (This is relevant not only to En- 
glish). Otherwise the 'standards' will be interpreted 
differently (and incompatibly) in different tagged 
corpora. We had hoped that the EAGLES tagset 
might constitute an 'interlingua' for translating be- 
tween existing tagsets. However, we have already 
had to conclude that our task of automatic tagset- 
mapping extraction can never achieve perfect accu- 
racy, as both source and target training data are 
noisy; using a fuzzy-edged tagset as an interlingua 
could only worsen matters. 
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